Discipline Assessment 1 - Reef

Author

530318646

Published

March 25, 2025

Code
library(tidyverse)
library(tidyr)
library(ggplot2)
library(viridis)
library(cvTools)
library(dplyr)
library(plotly)
library(lubridate)
library(ncdf4)
library(CFtime)
library(sf)
library(caret)
library(randomForest)
library(reshape2)
library(RColorBrewer)
library(car)
library(kableExtra)
# knitr::write_bib(c(.packages(),
#                    "tidyverse", "ggplot2","dplyr", "plotly", "viridis",
#                    "cvTools", "lubridate", "ncdf4", "CFtime", "sf", "caret",
#                    "randomForest", "reshape2", "RColorBrewer", "car", "kableExtra"), "refs.bib")

Part 1

Hi Wei! I heard you’ve recently been wrapped up with researching a recent global coral bleaching phenomenon and that you’re hoping to study the environmental variables that may have triggered these unfortunate events.

I’ve actually recently come across this public dataset recording the bleaching events at 3351 locations in 81 countries between 1998-2017, created by our friend Sully and colleagues (Sully et al. 2019). I was also having a look at the paper they published on this data, particularly this bold claim:

“The highest probability of coral bleaching occurred at tropical midlatitude sites (15–20 degrees north and south of the Equator).”

So, according to them, coral reefs within these specified regions had a higher likelihood of coral bleaching, represented as Average_Bleaching (%) in their data. The implications are quite significant, so I want to explore this using their data because, particularly data from 2015-2017.

Approach

All coding was done in R Version 4.4.1 (R Core Team 2024). Additionally, data cleaning, modelling and visualisations were done using a combination of Base R and functions from the following packages;

Code
# read in csv data files
reefCheck <- read_csv("Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv")

# # check structure & stuff
# str(reefCheck)
# dim(reefCheck)
# head(reefCheck)

# data variables cleaning & reformatting
reef = reefCheck |>
  mutate(Date = dmy(Date),
         Year = year(Date),
         Year = as.factor(Year),
         Average_bleaching = as.integer(Average_bleaching)) |> 
  filter(Year %in% c("2015", "2016", "2017")) |>
  mutate(Latitude_Range = ifelse(abs(Latitude.Degrees) >= 15 & abs(Latitude.Degrees) <= 20, "Midlatitude", "Other"))

First, I created a new categorical variable to classify which reefs were located in this mid-latitude range of either “15-20 degrees North” or “15-20 degrees South”. Then, I wanted to visualise the average bleaching of the coral reefs over time to see if there appears to be a difference between the “mid-latitude” and “other” reefs. As you can see, it looks like there is more coral bleaching occuring in the reefs outside the specified region, contradictory to the claim made in the paper (Sully et al. 2019).

Code
# scatterplot to look at average bleaching in mid-lat vs other reefs

bleach_plot = ggplot(reef, aes(x = Date, y = Average_bleaching, color = Latitude_Range)) +
  geom_point() +
  labs(title = "Coral Bleaching Probabilities of Mid-Latitude & Other Locations Over Time",
       x = "Time (years)",
       y = "Average Bleaching") +
  scale_color_viridis(discrete = TRUE) +  
  theme_minimal() +
  theme(panel.background = element_rect(fill = "lightyellow", color = NA),
        plot.background = element_rect(fill = "lightyellow", color = NA))

ggplotly(bleach_plot, tooltip = c("x", "y", "fill"))    
Code
# making the world map for coral bleaching over the years
world_map = map_data("world")

latBand <- tibble(
  xmin = rep(-180, 2), xmax = rep(180, 2),
  ymin = c(15, -20), ymax = c(20, -15)
)

plotMap = ggplot() +
  geom_polygon(data = world_map, 
               aes(x = long, 
                   y = lat, 
                   group = group), 
               fill = "lightyellow", 
               alpha = 0.8) +
  geom_rect(data = latBand, 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            fill = "blue", 
            alpha = 0.1) +
  geom_point(data = reef, 
             aes(x = Longitude.Degrees, 
                 y = Latitude.Degrees, 
                 size = Average_bleaching, 
                 color = Average_bleaching, 
                 frame = Year, 
                 text = paste("Reef:", 
                              Reef.Name, 
                              "<br>Average Bleaching:", 
                              Average_bleaching)), 
             alpha = 0.7) +
  scale_colour_viridis(option = "C") +
  theme_void() +
  labs(title = "Average Coral Bleaching From 2015-2017", x = NULL, y = NULL, 
  color = "Bleaching (%)") +
    theme(panel.background = element_rect(fill = "lightblue", color = NA),
          plot.background = element_rect(fill = "lightblue", color = NA),
          legend.text = element_text(size = 8),
          legend.title = element_text(size = 10, face = "bold")
          ) 

mapInt = ggplotly(plotMap, tooltip = "text") |>
  animation_opts(frame = 1000, transition = 500) |>
  animation_slider(currentvalue = list(prefix = "Year: "))

mapInt

Part 2

Hi Farhan! Wei told me all about your focus on the Great Barrier Reef (GBR) and using different environmental variables to predict future coral bleaching, how cool! I understand you’re concerned with whether using data from 4 years prior to the 2015-2017 bleaching events might be more useful than data collected during the events. In that case, I will be creating 2 different models to predict bleaching category based on variables from these 2 time periods, using these variables you mentioned:

  • Total Nitrogen
  • pH
  • Salt
  • Algae
  • Temperature

Approach - Random Forest

I cleaned and transformed the bleaching survey data, extracted the environmental variables from the NetCDF file and spatially joined based on proximity to create the final dataset joined_df. Then, I filtered bleaching category data for 2015-2017 \((n = 134)\), which was the only time period with bleaching data. For the second dataset, we joined the bleaching category data with variables from 4 years prior respectively (2011-2015, 2012-2016, 2013-2016) to create old_joined. The same libraries cited earlier were used in this stage, with the addition of 2 more:

Code
# read in csv data files
bleachData <- read_csv("bleachingSurveys-1.csv")

# # check structure for this data
# colnames(bleachData)
# dim(bleachData)
# head(bleachData)

# data cleaning & reformatting
bleachData_clean = bleachData |> mutate(
  year = factor(year), # 2015, 2016, 2017
  bleachCat = factor(bleachCat, levels = c(0, 1, 2, 3, 4))
) |>
  filter(!is.na(bleachCat)) |>
  na.omit()


# read in the nc file
eReefs_nc = ncdf4::nc_open(
"https://thredds.ereefs.aims.gov.au/thredds/dodsC/
GBR4_H2p0_B3p1_Cq3b_Dhnd/annual.nc?zc[1],
latitude[0:1:722],longitude[0:1:490],
temp[0:1:9][1][0:1:722][0:1:490],
TOTAL_NITROGEN[0:1:9][1][0:1:722][0:1:490],
MA_N[0:1:9][0:1:722][0:1:490],
PH[0:1:9][1][0:1:722][0:1:490],
salt[0:1:9][1][0:1:722][0:1:490],
time[0:1:9]")
#names(eReefs_nc$var)

# extract variables from nc file

# spatial data
lat = ncdf4::ncvar_get(eReefs_nc, "latitude")
long = ncdf4::ncvar_get(eReefs_nc, "longitude")

# time
time = ncdf4::ncvar_get(eReefs_nc, "time")
tunits = ncdf4::ncatt_get(eReefs_nc, "time", "units")
cf = CFtime::CFtime(tunits$value, calendar = "standard", time)
timestamps = CFtime::as_timestamp(cf)
timestamps = as.Date(timestamps, format = "%Y-%m-%d")


# predictors
total_nitrogen = ncdf4::ncvar_get(eReefs_nc, "TOTAL_NITROGEN")
pH = ncdf4::ncvar_get(eReefs_nc, "PH")
salt = ncdf4::ncvar_get(eReefs_nc, "salt")
algae = ncdf4::ncvar_get(eReefs_nc, "MA_N")
temp = ncdf4::ncvar_get(eReefs_nc, "temp")

# # check
# summary(total_nitrogen)
# summary(pH)
# summary(salt)
# summary(algae)
# summary(temp)

# make a df for these extracted variables
eReefs_df = expand.grid(long = long, lat = lat, time = timestamps) |>
  mutate(
    total_nitrogen = as.vector(total_nitrogen),
    pH = as.vector(pH),
    algae = as.vector(algae),
    salt = as.vector(salt),
    temp = as.vector(temp),
         )

# PELASE NOTE that i had to save joined data as a static file due to the long processing time of the st_join(), but the code i used to join the data is here and commented:

# bleaching_sf = st_as_sf(bleachData_clean, coords = c("longitude", "latitude"), crs = 4326)
# eReefs_sf = st_as_sf(eReefs_df, coords = c("long", "lat"), crs = 4326)
# 
# joined_sf = st_join(bleaching_sf, eReefs_sf, join = st_is_within_distance, dist = 1000, left = TRUE)

# save as rdata file in case i can't load it again
# full_joined = joined_sf |>
#   as.data.frame()
# 
# save(full_joined, file = "joined_data.RData")

load("joined_data.RData")

# current data
joined_df = full_joined |>
  filter(year(surveyDate) == year(time)) |>
  select(pH, total_nitrogen, algae, salt, temp, geometry, bleachCat) |>
    filter(
    !is.nan(pH),
    !is.nan(total_nitrogen),
    !is.nan(temp),
    !is.nan(algae),
    !is.nan(salt),
    !is.nan(bleachCat))


# old data & missing values
old_joined = full_joined |>
    filter(
      (year(time) == 2011 & year(surveyDate) == 2015) |
      (year(time) == 2012 & year(surveyDate) == 2016) |
      (year(time) == 2013 & year(surveyDate) == 2017)) |>
    select(pH, total_nitrogen, algae, salt, temp, geometry, bleachCat) |>
    filter(
    !is.nan(pH),
    !is.nan(total_nitrogen),
    !is.nan(temp),
    !is.nan(algae),
    !is.nan(salt),
    !is.nan(bleachCat))
  

# missing values and such
joined_df = joined_df |> filter(
  !is.nan(pH),
  !is.nan(total_nitrogen),
  !is.nan(temp),
  !is.nan(algae),
  !is.nan(salt))

After looking at this correlation matrix heatmap, we may have to worry about the multicollinearity between some of our predictors. For example, temperature is highly negatively correlated with pH and nitrogen (where pH and nitrogen are also positively correlated). However, this is because these variables have biochemical relationships so it is expected that they correlate with each-other such as nitrogen and temperature (Ana Alexandre 2020).

Code
numeric_data = data.frame(
  total_nitrogen = joined_df$total_nitrogen,
  pH = joined_df$pH,
  algae = joined_df$algae,
  salt = joined_df$salt,
  temp = joined_df$temp
)

cor_matrix = cor(numeric_data, use = "complete.obs")

melted_cor_matrix <- melt(cor_matrix)
melted_cor_matrix <- melted_cor_matrix[melted_cor_matrix$Var1 != melted_cor_matrix$Var2, ]
melted_cor_matrix <- melted_cor_matrix[as.numeric(melted_cor_matrix$Var1) < as.numeric(melted_cor_matrix$Var2), ]

corr_mat = ggplot(melted_cor_matrix, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(option = "D", limits = c(-1, 1)) +  # Use the viridis colour scale
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  coord_fixed() +
  labs(x = "", y = "", title = "Correlation Matrix") +
  theme(axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12),
        panel.background = element_rect(fill = "lightyellow", color = NA),
        plot.background = element_rect(fill = "lightyellow", color = NA))

ggplotly(corr_mat)

Additionally, our target variable bleaching category which is an ordinal variable with 5 levels (0, 1, 2, 3, 4) has a small and discrete range, with the distribution appearing to look not normally distributed and more positively skewed. This may be due to the small range of the variable and the fact that coral bleaching isn’t particularly wide-spread (Great Barrier Reef Marine Park Authority 2025).

Code
bleach_hist = ggplot(joined_df, aes(x = factor(bleachCat))) +
  geom_bar(fill = "#440154", color = "black") +
  labs(title = "Bleach Category Distribution (2015-2017)",
       x = "Bleaching Category", 
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        panel.background = element_rect(fill = "lightyellow", color = NA),
        plot.background = element_rect(fill = "lightyellow", color = NA)) +
  scale_x_discrete(limits = c("0", "1", "2", "3", "4"))

ggplotly(bleach_hist)

Using the first dataset with the contemporary variables, we trained the first random forest classification model to predict bleaching category in 2015-2017. To compute the performance metrics, I implemented a 10-fold cross-validation; I went with 10-fold rather than 5-fold because our dataset was not very large (134 observations).

Code
# RF model with contemporary data (2015-2017)
set.seed(3888)

modelVars = c("pH", "total_nitrogen", "algae", "salt", "temp", "bleachCat")
modelDat_contemp = joined_df |> select(all_of(modelVars))

n = nrow(modelDat_contemp)
train_index = sample(1:n, size = 0.7 * n)

training_data = modelDat_contemp[train_index, ]
testing_data = modelDat_contemp[-train_index, ]

# model
rf_model = randomForest(
  bleachCat ~ pH + total_nitrogen + algae + salt + temp, 
  data = training_data,
  ntree = 500,
  importance = TRUE)
y_pred = predict(rf_model, newdata = testing_data)

# confmat
unique_classes = sort(unique(modelDat_contemp$bleachCat))
num_classes = length(unique_classes)
cm_contemp = table(Predicted = y_pred, Actual = testing_data$bleachCat)
full_cm = matrix(0, nrow = num_classes, 
                  ncol = num_classes, 
                  dimnames = list(unique_classes, unique_classes))
full_cm[rownames(cm_contemp), colnames(cm_contemp)] <- cm_contemp
cm_contemp = cm_contemp + full_cm
Code
# confmat
kable(cm_contemp, caption = "10-Fold Confusion Matrix for RF (Contemporary Data)")
10-Fold Confusion Matrix for RF (Contemporary Data)
0 1 2 3 4
0 20 10 4 4 2
1 0 6 0 0 0
2 0 2 2 6 0
3 0 2 4 10 4
4 0 0 0 2 4

Now using the dataset with the old variables (4 years prior), we trained the second random forest classification model to predict bleaching category in 2015-2017. To compute the performance metrics, I once again used a 10-fold cross-validation for the same reasons.

Code
# RF model with old data (2011-2013)
set.seed(3888)

modelVars = c("pH", "total_nitrogen", "algae", "salt", "temp", "bleachCat")
modelDat_old = old_joined |> select(all_of(modelVars))

n = nrow(modelDat_old)
train_index = sample(1:n, size = 0.7 * n)

training_data = modelDat_old[train_index, ]
testing_data = modelDat_old[-train_index, ]

# model
rf_model = randomForest(
  bleachCat ~ pH + total_nitrogen + algae + salt + temp, 
  data = training_data,
  ntree = 500,
  importance = TRUE
)
y_pred = predict(rf_model, newdata = testing_data)

# confmat
unique_classes = sort(unique(modelDat_old$bleachCat))
num_classes = length(unique_classes)
cm_old = table(Predicted = y_pred, Actual = testing_data$bleachCat)
full_cm = matrix(0, nrow = num_classes, 
                  ncol = num_classes, 
                  dimnames = list(unique_classes, unique_classes))
full_cm[rownames(cm_old), colnames(cm_old)] <- cm_old
cm_old = cm_old + full_cm
Code
# mean cofmat
kable(cm_old, caption = "10-Fold Confusion Matrix for RF (Old Variables)")
10-Fold Confusion Matrix for RF (Old Variables)
0 1 2 3 4
0 20 10 4 4 2
1 0 6 0 0 0
2 0 2 2 8 0
3 0 2 4 8 4
4 0 0 0 2 4

Based on the performance metrics calculated, it seems that the model computed with the old variables performed worse than the model with contemporary variables, albeit not significantly better

  • Accuracy: proportion of correct predictions: \(\frac{TP+TN}{TP+TN+FP+FN}\)
  • Precision: precision of positive predictions that were correct; \(\frac{TP}{TP+TN}\)
  • Recall: aka sensitivity, so proportion of true positives that were correctly identified: \(\frac{TP}{TP+FP}\)
  • F1: a better accuracy metric for RF (and my bleachCat imbalanced dataset), synthesizes precision and recall to make an optimised metric; \(\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)
Code
# making a function that will calculate metrics
calculate_metrics = function(cm) {
  TP = diag(cm) # TPs are diagonal
  FP = colSums(cm) - TP # FPs are column sums - diagonal
  FN = rowSums(cm) - TP # FNs are row sums - diagonal
  TN = sum(cm) - (TP + FP + FN) # TNs are the total sum - others
  
  accuracy = round(sum(TP) / sum(cm), 2)
  precision = round(TP / (TP + FP), 2)
  recall = round(TP / (TP + FN), 2)
  f1 = round(2 * (precision * recall) / (precision + recall), 2)

  metrics = data.frame(
    Class = rownames(cm),
    Precision = precision,
    Recall = recall,
    F1 = f1,
    Accuracy = accuracy
  )
  
  # Calculate column averages, excluding the "Class" column
  col_averages = colMeans(metrics[, -1], na.rm = TRUE)  # Exclude the "Class" column
  
  # Add a row for averages at the bottom of the table
  avg_row = c("Average", col_averages)
  metrics = rbind(metrics, avg_row)
  return(metrics)
}

metrics_contemp = calculate_metrics(cm_contemp)
kable(metrics_contemp, caption = "Contemporary Model Metrics", row.names = FALSE)
Contemporary Model Metrics
Class Precision Recall F1 Accuracy
0 1 0.5 0.67 0.51
1 0.3 1 0.46 0.51
2 0.2 0.2 0.2 0.51
3 0.45 0.5 0.47 0.51
4 0.4 0.67 0.5 0.51
Average 0.47 0.574 0.46 0.51
Code
# old model metrics
metrics_old = calculate_metrics(cm_old)
kable(metrics_old, caption = "Old Model Metrics", row.names = FALSE)
Old Model Metrics
Class Precision Recall F1 Accuracy
0 1 0.5 0.67 0.49
1 0.3 1 0.46 0.49
2 0.2 0.17 0.18 0.49
3 0.36 0.44 0.4 0.49
4 0.4 0.67 0.5 0.49
Average 0.452 0.556 0.442 0.49

Evaluation

Before we make our conclusion, there are some limitations to consider that need to be addressed;

  • Data Quality and Missing Values: missing values in environmental variables were filtered, which reduced the sample size significantly and possibly introduced bias new sources of bias (sampling bias, imputation bias, etc.).

  • Random Forest: although effective, RF models has its own limitations. It’s “black box” nature makes it less interpretable than simpler models like logistic regression and RFs are also jsut prone to over-fitting. Other alternatives could be using support vector machines (SVM), neural networks or logistic regressions.

Conclusion

In conclusion, the results point towards working with contemporary environmental variables as they provide a slightly better model for predicting coral bleaching than data from four years prior. It also makes more sense that this model would perform better as the data is collected in the same time as the bleaching data. So, real-time monitoring and immediate environmental conditions seem to be more crucial for prediction than historical trends alone. However, my filtered dataset was quite small so using a larger sample and even including more historical data (>4 years) could have made predictions better.

However, my results are not the end all be all! Here are some recommendations for future directions:

  • Additional predictors in models, such as climate anomalies (e.g., El Niño) and local reef characteristics to improve predictive accuracy.
  • Compute multiple models from both datasets and see if model choice influenced results (comparing 2 models only may not have been sufficient)
  • Including more data even if it had missing values (e.g. assigning averages to missing values, i=only excluding fully missing rows and duplicats, etc.)
Code
ncdf4::nc_close(eReefs_nc)

References

Alfons, Andreas. 2024. cvTools: Cross-Validation Tools for Regression Models. https://CRAN.R-project.org/package=cvTools.
Ana Alexandre, Paul W. Hill, Raquel Quintã. 2020. “Ocean Warming Increases the Nitrogen Demand and the Uptake of Organic Nitrogen of the Globally Distributed Seagrass Zostera Marina.” Functional Ecology. https://doi.org/10.1111/1365-2435.13576.
Breiman, Leo, Adele Cutler, Andy Liaw, and Matthew Wiener. 2024. randomForest: Breiman and Cutlers Random Forests for Classification and Regression. https://www.stat.berkeley.edu/~breiman/RandomForests/.
Garnier, Simon. 2024. Viridis: Colorblind-Friendly Color Maps for r. https://sjmgarnier.github.io/viridis/.
Great Barrier Reef Marine Park Authority. 2025. “Coral Bleaching Categories.” https://www2.gbrmpa.gov.au/learn/reef-health/coral-bleaching-categories.
Kuhn, Max. 2024. Caret: Classification and Regression Training. https://github.com/topepo/caret/.
Neuwirth, Erich. 2022. RColorBrewer: ColorBrewer Palettes. https://CRAN.R-project.org/package=RColorBrewer.
Pebesma, Edzer. 2024. Sf: Simple Features for r. https://r-spatial.github.io/sf/.
Pierce, David. 2024. Ncdf4: Interface to Unidata netCDF (Version 4 or Earlier) Format Data Files. https://cirrus.ucsd.edu/~pierce/ncdf/.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, and Pedro Despouy. 2024. Plotly: Create Interactive Web Graphics via Plotly.js. https://plotly-r.com.
Spinu, Vitalie, Garrett Grolemund, and Hadley Wickham. 2024. Lubridate: Make Dealing with Dates a Little Easier. https://lubridate.tidyverse.org.
Sully, Sarah, Deron E. Burkepile, Michelle K. Donovan, et al. 2019. “A Global Analysis of Coral Bleaching over the Past Two Decades.” Nature Communications 10: 1264. https://doi.org/10.1038/s41467-019-09238-2.
Van Laake, Patrick. 2025. CFtime: Using CF-Compliant Calendars with Climate Projection Data. https://github.com/pvanlaake/CFtime.
Wickham, Hadley. 2020. Reshape2: Flexibly Reshape Data: A Reboot of the Reshape Package. https://github.com/hadley/reshape.
———. 2023a. Stringr: Simple, Consistent Wrappers for Common String Operations. https://stringr.tidyverse.org.
———. 2023b. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and Teun van den Brand. 2024. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org.
Wickham, Hadley, Davis Vaughan, and Maximilian Girlich. 2024. Tidyr: Tidy Messy Data. https://tidyr.tidyverse.org.